Introduction to Common Model Selection Critera

Model specification is an element of statistics whose importance is often glossed over. In most cases, there are multiple appropriate models for a set of data. The end choice of model can hugely impact the results, and classical methods offer limited guidance on the best process for accounting for the uncertainty this creates. Unstructured searches and checks for the best model specification can lead to incorrect inferences, fragile reported findings, and publication bias [@Montgomery]. Frequentist approaches to model selection include but are not limited to general using the Akaike Information Criterion (AIC) and using the root mean squared error (RMSE) or R^2. We will thus elaborate a little on AIC and two other common information criterion, BIC and WAIC.

Information Criterion

Information Criterion are commonly used to select the ‘best’ model from a set or to understand how much better one model is compared to another. They can be thought of as trying to estimate how good the model is at fitting out of sample data, when no such data is available. All of these criterion consist of a likelihood component, which estimates how good the model is at fitting within sample data, and a bias corrector, to account for how models are usually better at fitting within sample data than out of sample data. For all information criterion, a lower score is better. They can only be directly compared to themselves.

Akaike Information Criterion

The Akaike Information Criterion (AIC) is the simplest information criterion, and the most commonly used:

\[AIC = -2 ln(L) + 2k\]

Where \(L\) is the model’s likehood, and \(k\) is the number of parameters in the model. Because more parameters allow a model to fit better to within sample data, it is used as the bias corrector.

Given multiple models \(i\), you can estimate the probability a model \(k\) minimizes information loss as such, where \(AIC_{min}\) is the lowest score in the set of models.

\[P_k = exp\left(\frac{AIC_{min} - AIC_k}{2}\right)\]

AIC fits well in the freqentist frame. The likelihood component is assumed to come from the maximum likelihood estimates of paramters, which Bayesian methods don’t select for, and point estimates for parameters are used, instead of using the full posterior distribution.

Bayesian Information Criterion

The misleadingly named Bayesian Information Criterion (BIC), is given by:

\[BIC = - 2ln(L) + k\;ln(n)\] Where \(k\) is the number of parameters in the model, \(n\) is the number of data points, and \(L\) is again the likelihood function.

BIC penalizes free parameters more than the AIC when data sets are large. While AIC tries to select the best model that describes the data presented, the BIC attempts to select the true model from among a model set. However, like AIC, BIC doesn’t naturally fit with the Bayesian framework as it is scored by the likelihood function.

Widely Applicable Information Criterion

One information criterion that fits naturally within the Bayesian framework is the Widely Applicable Information Criterion (WAIC):

\[WAIC = -2\sum_{i=1}^nlog\left(\frac{1}{S}\sum_{s=1}^S p(y_i\vert\theta^s) \right) +2\sum_{i=1}^nV_{s=1}^S(log(p(y_i\vert\theta^s)))\]

Where \(S\) is the number of posterior draws, \(\theta^s\) is the s-th posterior draw for parameters \(theta\), and \(V^S_{s=1}(a_s)\) is the sample variance of \(a_s\).

The first half of WAIC uses the posterior distribution of \(\theta\), as opposed to only likelihood estimates for theta as are used in AIC and BIC. This means that the WAIC is fully Bayesian. The second half is the bias corrector which can be thought of as the number of unconstrained parameters accounting for the complex ways Bayesian parameters interact.

Bayesian Model Averaging

Bayesian Model Averaging (BMA) offers an alternative practice that helps ensure findings are robust to a variety of model specifications. At its simplest level, BMA assigns priors to potential model specifications and then caluclates posterior distributions for the model itself, in addition to the coefficients within the specification. This is thus an extension of previous Bayesian theory that focuses solely on coefficient estimation.

Let us begin by considering a matrix \(X\) of all the \(n \times p\) potential independent variables to predict a response variable \(Y\). A standard linear analysis would assume \(Y = X \beta + \epsilon\), where \(\beta\) is a coefficient matrix and \(\epsilon\) ~ \(N(0, \sigma^2)\). There are \(q=2^p\) potential model specifications from the model space \(\{M_1, M_2, ...M_q\}\), and in many cases, there is ambiguity about which of these is best. BMA incorporates this uncertainty into the process rather than ignoring it and claiming that the final model is the only option. This leads to greater flexibility in the inferences of the end results [@Montgomery].

Each model \(M_k\) encompasses the likelihood function \(L(Y|\beta_k, M_k)\) of the observed data \(Y\) in terms of a model-specific coefficient matrix \(\beta_k\) with a prior \(\pi(\beta_k|M_k)\). Both the likelihood and priors are conditional on a particular model [@Fragoso]. The posterior distribution for the model parameters is then \[\pi(\beta_k|Y, M_k) = \frac{L(Y|\beta_k, M_k) \; \pi(\beta_k | M_k)}{\int L(Y|\beta_k, M_k) \; \pi(\beta_k | M_k) \; d\beta_k}\]

The above denominator represents the marginal distribution of the observations over all paramteter values specified in model \(M_k\). It is called the model’s marginal likelihood or model evidence and is denoted by \(\pi(Y|M_k)\). BMA now assumes a prior distribution over the model space describing the prior uncertainty over each model’s capability to accurately describe and/or predict the data. This is modeled as a probability density over the model space, with values \(\pi(M_k)\) for \(k = 1, 2, ... q\) [@Fragoso].

Then, the posterior probability of model specification \(M_k\) is \[\pi(M_k | Y) = \frac{L(M_k | Y) \; \pi(M_k)}{\sum_{k=0}^q \; \pi(Y|M_k)\; \pi(M_k)}\].

Pseudo-Bayesian Model Averaging

Traditional BMA is not without issues. For example, changing priors from one vague prior to another can significantly impact posterior model probablities. And calculating model likelihoods becomes extremely diffiuclt for anything but the simplest models.

As a result of these issues, a method called Pseudo-BMA was created. This has the same core idea as BMA, but instead of calculating model likelhoods, something called Leave-One-Out cross validation (LOO-CV) is used.

Similar to the information criterion, LOO-CV estimates how good a model is at predicting out of sample data. To do this, it fits the model on all but one data points, then checks how well it predicts the missing data point. The log of this probability is used This is done for every data point in the sample.

Put mathematically given S simulation draws , this is:

\[ \sum_{i=1}^nlog(p(y_i\vert y_1,\dots y_{i-1},y_{i+1},y_n)) \]

Where the term in the \(log\) is the probability of seeing \(y_i\) from a model trained on all the data excluding \(y_i\).

By adding a complicated bias term, this can be turned into an information criterion. Both AIC and WAIC will approach LOO-CV as sample size increases.

While exact LOO requires \(n\) iterations, one for each point in \(y\), Pseudo-BMA reduces computational complexity by taking samples from the posterior distribution. Using a technique called Pareto Smoothed Importance Sampling (PSIS), LOO-CV is estimated. This method also allows Pseudo-BMA to be used on models with parameters that have already been fitted.

Given a dataset \(y\), models \(M_k\), weights \(w_{i,k}^s\), and parameters\(\Theta\), PSIS-LOO is an efficent way to approximate the estimate of the expected log pointwise predictive density for model k, which is the measure of how good model k is at fitting out of sample data:

\[ \widehat{elpd}^k=\sum_{i=1}^n log(\hat{p} (y_i | y_1,\dots y_{i-1},y_{i+1},y_n, M_K)) \]

Where conditioning on \(M_k\) means using that model for parameter estimates.

This is similar to model probabilities using AIC, model probabilities are calculated by:

\[ w_k = \frac{exp(\widehat{elpd}^k)}{\sum_{k=1}^Kexp(\widehat{elpd}^k)} \]

In practice, one more step involving bootstrapping is used to deal with additional bias in the estimate. This final model weight \(w_k\) is the approximate probability of model k being to true model.

These model probabilities can be used in a variety of ways. For example, BMA allows for a direct combination of models to calculate combined estimations for parameters, which leads to lower risk predictions than a single model [@Fragoso]. In addition, BMA can be used in model selection by choosing the model with the highest posterior probability. We will focus on this latter application.

Implementation

library(ggplot2)
library(dplyr)
library(rstan)
library(rstanarm)
library(bayesplot)
library(loo)
library(tidyr)
library(ggridges)
library(purrr)

To illustrate Pseudo-BMA in practice, we’ve chosen the famous candy dataset from the fivethirtyeight package:

candy <- fivethirtyeight::candy_rankings
dim(candy)
## [1] 85 13
names(candy)
##  [1] "competitorname"   "chocolate"        "fruity"           "caramel"         
##  [5] "peanutyalmondy"   "nougat"           "crispedricewafer" "hard"            
##  [9] "bar"              "pluribus"         "sugarpercent"     "pricepercent"    
## [13] "winpercent"
head(candy)
## # A tibble: 6 x 13
##   competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer
##   <chr>          <lgl>     <lgl>  <lgl>   <lgl>          <lgl>  <lgl>           
## 1 100 Grand      TRUE      FALSE  TRUE    FALSE          FALSE  TRUE            
## 2 3 Musketeers   TRUE      FALSE  FALSE   FALSE          TRUE   FALSE           
## 3 One dime       FALSE     FALSE  FALSE   FALSE          FALSE  FALSE           
## 4 One quarter    FALSE     FALSE  FALSE   FALSE          FALSE  FALSE           
## 5 Air Heads      FALSE     TRUE   FALSE   FALSE          FALSE  FALSE           
## 6 Almond Joy     TRUE      FALSE  FALSE   TRUE           FALSE  FALSE           
## # … with 6 more variables: hard <lgl>, bar <lgl>, pluribus <lgl>,
## #   sugarpercent <dbl>, pricepercent <dbl>, winpercent <dbl>
summary(candy)
##  competitorname     chocolate         fruity         caramel       
##  Length:85          Mode :logical   Mode :logical   Mode :logical  
##  Class :character   FALSE:48        FALSE:47        FALSE:71       
##  Mode  :character   TRUE :37        TRUE :38        TRUE :14       
##                                                                    
##                                                                    
##                                                                    
##  peanutyalmondy    nougat        crispedricewafer    hard        
##  Mode :logical   Mode :logical   Mode :logical    Mode :logical  
##  FALSE:71        FALSE:78        FALSE:78         FALSE:70       
##  TRUE :14        TRUE :7         TRUE :7          TRUE :15       
##                                                                  
##                                                                  
##                                                                  
##     bar           pluribus        sugarpercent     pricepercent   
##  Mode :logical   Mode :logical   Min.   :0.0110   Min.   :0.0110  
##  FALSE:64        FALSE:41        1st Qu.:0.2200   1st Qu.:0.2550  
##  TRUE :21        TRUE :44        Median :0.4650   Median :0.4650  
##                                  Mean   :0.4786   Mean   :0.4689  
##                                  3rd Qu.:0.7320   3rd Qu.:0.6510  
##                                  Max.   :0.9880   Max.   :0.9760  
##    winpercent   
##  Min.   :22.45  
##  1st Qu.:39.14  
##  Median :47.83  
##  Mean   :50.32  
##  3rd Qu.:59.86  
##  Max.   :84.18

Exploratory EDA

Plotting the top 10 candies by win percentage

candy %>%
  head(20) %>% 
  ggplot() +
  geom_col(aes(y = reorder(competitorname, winpercent), x = winpercent)) +
  labs(x = "Percent of Head to Heads Won", y = NULL, title = "Top 20 Candies") 

Plot how many candies are each type

candy %>% 
  summarise_if(is.logical, sum) %>% 
  pivot_longer(1:ncol(.), names_to = "type", values_to = "count") %>%
  mutate(type_clean = c("Chocolate", "Fruity", "Caramel", "Peanuty or Almondy", "Nougat", "Crisped Rice Wafer", "Hard", "Bar", "Several Candies in One Bag")) %>% 
  ggplot(aes(y = reorder(type_clean, count), x = count)) +
  geom_col() +
  labs(y = NULL, x = NULL, title = "Number of Candies with each Attribute")

Plot win percent by cost

candy %>% 
  ggplot(aes(x = pricepercent, y = winpercent)) +
  geom_jitter() +
  labs(x = "Relative Price", y = "Percent of Head to Heads Won", title = "Price versus Win Rate")
<<<<<<< HEAD

=======

>>>>>>> 88350af0b9304de10410dce83574f52679ec1cd5

Plot win percent by sugar

candy %>% 
  ggplot(aes(x = sugarpercent, y = winpercent)) +
  geom_jitter() +
  labs(x = "Sugar Content (Percent)", y = "Percent of Head to Heads Won", title = "Sugar Content versus Win Rate") +
  ylim(0, 100)
<<<<<<< HEAD

=======

>>>>>>> 88350af0b9304de10410dce83574f52679ec1cd5

Win percent for each candy type

candy %>% 
  select(winpercent, 2:10) %>%
  pivot_longer(2:10, names_to = "type", values_to = "is_type") %>% 
  filter(is_type) %>% 
  select(-is_type) %>% 
  group_by(type) %>% 
  mutate(mean_win = mean(winpercent)) %>% 
  ungroup() %>% 
  ggplot(aes(x = winpercent, y = reorder(type, mean_win), height = stat(density))) + 
  geom_density_ridges(stat = "binline", bins = 20, scale = 0.95, draw_baseline = FALSE) +
  labs(x = "Percent of Head to Heads Won", y = NULL, title = "Candy Type by Win Percent") + 
  scale_y_discrete(labels = rev(c("Crisped Rice Wafer", "Peanuty or Almondy", "Bar", "Chocolate", "Nougat", "Caramel", "Several Candies in One Bag", "Fruity", "Hard")))

We start by creating regular stan models. Each model differs by the number of variables considered, in this case. These posterior distributions are approximated via MCMC.

set.seed(454)
model_1 <- stan_glm(winpercent ~ sugarpercent, 
                data = candy,
                family = gaussian, 
                chains = 4, iter = 2*5000, refresh = FALSE)
model_2 <- stan_glm(winpercent ~ sugarpercent + pricepercent, 
                data = candy,
                family = gaussian, 
                chains = 4, iter = 2*5000, refresh = FALSE)
model_3 <- stan_glm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus, 
                data = candy,
                family = gaussian, 
                chains = 4, iter = 2*5000, refresh = FALSE)
model_4 <- stan_glm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + sugarpercent + pricepercent, 
                data = candy,
                family = gaussian, 
                chains = 4, iter = 2*5000, refresh = FALSE)
model_list <- list(model_1 = model_1, model_2 = model_2, model_3 = model_3, model_4 = model_4)

Lets examine the chains

mcmc_trace(model_1)

mcmc_dens_overlay(model_1)

mcmc_trace(model_2)

mcmc_dens_overlay(model_2)

mcmc_trace(model_3)

mcmc_dens_overlay(model_3)

mcmc_trace(model_4)

mcmc_dens_overlay(model_4)

Examining the stability of our models: Ssimulations are somewhat but not super stable. There are a few noticeable outliers in every model, as well as a relatively wide range of packed simulations. They seem to roughly fit, though go a little high overall.

pp_check(model_1)

pp_check(model_2)

pp_check(model_3)

pp_check(model_4)

mcmc_trace(model_1)

mcmc_trace(model_2)

mcmc_trace(model_3)

mcmc_trace(model_4)

As we are satsified, we can proceed with these models. Here are the coefficients:

model_1$coefficients
##  (Intercept) sugarpercent 
##     44.67007     11.81196
model_2$coefficients
##  (Intercept) sugarpercent pricepercent 
##    39.802262     6.722496    15.580509
model_3$coefficients
##          (Intercept)        chocolateTRUE           fruityTRUE 
##           35.2393393           19.6522822           10.0194746 
##          caramelTRUE   peanutyalmondyTRUE           nougatTRUE 
##            3.3625741            9.9863058            2.3087578 
## crispedricewaferTRUE             hardTRUE              barTRUE 
##            8.8676580           -4.8126445           -0.5306398 
##         pluribusTRUE 
##           -0.1460981
model_4$coefficients
##          (Intercept)        chocolateTRUE           fruityTRUE 
##           34.7283883           19.5438496            9.1570800 
##          caramelTRUE   peanutyalmondyTRUE           nougatTRUE 
##            2.2049288            9.9400899            0.7589168 
## crispedricewaferTRUE             hardTRUE              barTRUE 
##            8.7610339           -6.0805383            0.5256453 
##         pluribusTRUE         sugarpercent         pricepercent 
##           -0.8447881            9.1226100           -5.7884718

Model Evaluations using ELPD

Once the models are calculated, we calcuate Leave One Out expected log point predictive densities (ELPD LOO) for each model.

Calculate

"Model 1 Loo"
## [1] "Model 1 Loo"
(loo_1 <- loo(model_1))$estimates
##             Estimate         SE
## elpd_loo -349.299296  5.8136277
## p_loo       2.708551  0.5166705
## looic     698.598592 11.6272553
"Model 2 Loo"
## [1] "Model 2 Loo"
(loo_2 <- loo(model_2))$estimates
##             Estimate         SE
## elpd_loo -346.734615  6.6778576
## p_loo       4.051253  0.9045871
## looic     693.469229 13.3557153
"Model 3 Loo"
## [1] "Model 3 Loo"
(loo_3 <- loo(model_3))$estimates
##            Estimate        SE
## elpd_loo -329.91156  5.818139
## p_loo      10.57069  1.541516
## looic     659.82312 11.636278
"Model 4 Loo"
## [1] "Model 4 Loo"
(loo_4 <- loo(model_4))$estimates
##            Estimate        SE
## elpd_loo -330.25141  5.930144
## p_loo      12.93009  1.862167
## looic     660.50283 11.860288
lpd_point <- cbind(
  loo_1$pointwise[,"elpd_loo"], 
  loo_2$pointwise[,"elpd_loo"],
  loo_3$pointwise[,"elpd_loo"],
  loo_4$pointwise[,"elpd_loo"]
)

With the ELPD calculations, the models are able to be ranked in order of their contribution, as a percentage. As shown, model3 has the highest weight with 0.534, followed by model4. By this metric, model3 appears to be the best model from the set of 4 we created above.

(weights <- pseudobma_weights(lpd_point))
## Method: pseudo-BMA+ with Bayesian bootstrap
## ------
##        weight
## model1 0.001 
## model2 0.006 
## model3 0.552 
## model4 0.441

Predictions

A new type of candy is created (new_candy), which is turned into a dataframe for prediction. A table of predictions for all models is then created, titled predictions.

new_candy <- data.frame(chocolate = TRUE, fruity = TRUE, caramel = TRUE, peanutyalmondy = TRUE, nougat = TRUE, crispedricewafer = TRUE, hard = FALSE, bar = FALSE, 
                        pluribus = FALSE, sugarpercent = 0.20, pricepercent = 0.9)
my_predict <- function(model) {
  posterior_predict(model, newdata = new_candy)
}
make_df <- function(predictions, index) {
  data.frame(winpercent_new = predictions[,1], model = index)
}
predictions <- map(model_list, my_predict) %>% 
  map2(1:4, make_df)

Using the weights we calculated from using the pseudobma_weights() function to obtain ELPD scores, we can then generate the predictive distributions for the win percentage of our made up candy, defined above.

sampled_pred <- predictions %>%  #calculating sample predictions for each model 
  map2(weights, sample_frac) %>% 
  bind_rows() %>% 
  mutate(model = as.character(model))
sampled_pred %>% 
  ggplot(aes(x = winpercent_new, fill = model, color = model)) +
  geom_density_ridges(alpha = 0.2, aes(y = model)) + labs(title = 'Posterior Predictive Distributions for our Generated Candy')
## Picking joint bandwidth of 3.32

Unsurprisingly, the different models provided slightly different predictions. The \(MAP\) ‘winpercent’ for model, model2 appear to be closer to 45-50%, while the model3, model4 show a ‘winpercent’ of closer to 80%.

sampled_pred %>% 
  ggplot(aes(x = winpercent_new, fill = model, color = model)) +
  geom_histogram(alpha = 0.9, position = "stack")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As a more visual demonstration of the weights, the contribution of each model is evident in the predictive distribution, with model 3 contributing the largest, followed by model 4. Thus, it is not surprising that the pseudo-BMA model also has an \(MAP\) of around 80%, given the dominance of model3 in the weighted averaging.

Pseudo-BMA vs. Single Model Performance

As shown above, it is evident that the pseudo-BMA model’s prediction is heavily reliant on model 3 & 4, with similar \(MAP\) values of around 80% for the win percentage for the made up candy. To better understand the advantage of Pseudo-BMA, we can compare its predictions to that of model 3, the ‘best’ standalone model, using the ppc_intervals function.

Once again, unsurprisingly, the mean predictions for all 17 candies for model3 and the pseudo-BMA weighted average are fairly close. However, evident for Baby Ruth, Nestle Butterfinger, and Reese’s Peanut Butter Cup the pseudo-BMA weighted average model is slightly closer to the actual win percentage. In other cases, like Sour Patch Kids, Reese’s Miniatures, and Air Heads, model 3 appears to have better predictions.

A Comparison of Pseudo-BMA and Model 3 Accuracy

 # rbind(prediction_summary(y = sampled_candy$winpercent, 
  #yrep = colMeans(pseudo_predictions)),
  #      prediction_summary(y = sampled_candy$winpercent, 
  #yrep = candy_predictions_3))

tibble(
  model = c('Pseudo-BMA', 'Model 3'),
  mae = c(22.25264, 24.55274),
  mae_scaled = c(0.7003848,5.5119684),
  within_50 = c(0.17647059, 0.05882353),
  within_95 = c(0.2352941   , 0.1764706 )
)
## # A tibble: 2 x 5
##   model        mae mae_scaled within_50 within_95
##   <chr>      <dbl>      <dbl>     <dbl>     <dbl>
## 1 Pseudo-BMA  22.3      0.700    0.176      0.235
## 2 Model 3     24.6      5.51     0.0588     0.176

While hard to tell which is better visually, prediction_summary() plainly indicates the better model. Due to the use of a weighted average, Pseudo BMA is evidently more accurate than model 3. On average, pseudo-BMA was 2% closer to the actual win percentage of a candy than model 3, (MAE), and 23.5% of the candy’s win percentage is within the middle 95% of pseudo-BMA’s predictive distribution.

Conclusion(?)

The concept of taking a weighted average of the predictions from multiple models is not a new technique, given the obvious benefits of increased flexibility and robustness to predictions. However, weighted averages are predominantly seen in Frequentist approaches to modelling, and less so in the Bayesian approach. Bayesian Model Averaging, and more specifically Pseudo-BMA, is a more computationally efficient method to evaluate each model’s predictive distribution, assigning ELPD scores for each model. ELPD, as fractional indication of model weight, allows for predictions from each model to be combined in a manner that provides the optimal predictive distribution, a Pseudo-BMA model. As shown by the candy dataset, Pseudo-BMA naturally outperforms the best single-model calculated, and could only improve with even more models tested.

Appendix Section:

PPC Interval Graphs for models 1,2,4:

Models graphed in that order:

candy_plotter(model_1)

candy_plotter(model_2)

candy_plotter(model_4)